This is the markdownfile for my assignment 8 on cluster analysis. For this assignment I expanded on the code I wrote for the cluster analysis class project adding more methods for distance calculation and visualization. Once again, I used cancer gene expression and survival data to show clustering applications to analyze genetic data on two types brain cancers: Low Grade Glioma (Glioma) and Glioblastoma. Only genes that had a p-value < 0.05 relating gene expression to survival were included in the analysis. Shared genes significant to survival in both brain cancers was subsetted and then random samples of 50 and 100 genes were chosen to create the corresponding glioma and glioblastoma dataframes for clustering analysis. Methods used include hierarchiacal clustering using Average Linkage and Ward(ANOVA) and partition clustering using K-means and PAM. I also explored visualization using heatmaps with correlation-based distance measures including “pearson”, “kendall”, and “spearman” methods.
Gene expression data is based on The Cancer Genome Atlast (TCGA) found on the National Cancer Institute’s Genomic Data Portal (https://gdc-portal.nci.nih.gov/). Survival data was derived from Stanford’s Precog database (https://precog.stanford.edu/), and Oncolnc (http://www.oncolnc.org/). The data was reformatted to suit clusting analysis.
precogdata <- read_excel("/Users/louisecabansay/Downloads/Precog_MetaZ.xlsx")
gbm.oncolnc <- read_excel("/Users/louisecabansay/Documents/School Work/UCBX R Class/GBM_mrna.xlsx")
gli.oncolnc <- read_excel("/Users/louisecabansay/Documents/School Work/UCBX R Class/LowerGradeGlioma.xlsx")
gbm.precog <- data.frame(precogdata$Gene, precogdata$Name, precogdata$`Unweighted_meta-Z_of_all_cancers`, precogdata$Brain_cancer_Glioblastoma)
gli.precog <- data.frame(precogdata$Gene, precogdata$Name, precogdata$`Unweighted_meta-Z_of_all_cancers`, precogdata$Brain_cancer_Glioma)
#examine data
names(gbm.oncolnc)[1] <- "GeneName"
names(gli.oncolnc)[1] <- "GeneName"
names(gbm.precog)[1] <-"GeneName"
names(gli.precog)[1] <-"GeneName"
str(gbm.precog)
## 'data.frame': 23287 obs. of 4 variables:
## $ GeneName : Factor w/ 23287 levels "42795","42796",..: 1570 5930 21090 21165 2212 14283 2585 2559 2481 21681 ...
## $ precogdata.Name : Factor w/ 19740 levels "1-acylglycerol-3-phosphate O-acyltransferase 1 (lysophosphatidic acid acyltransferase, alpha)",..: 1343 5333 17355 17380 3178 11431 2319 2295 3244 17255 ...
## $ precogdata..Unweighted_meta.Z_of_all_cancers.: num 12.2 12.1 11.9 11.7 11.4 ...
## $ precogdata.Brain_cancer_Glioblastoma : num 0.296 -0.359 -0.124 -0.742 -0.427 ...
str(gbm.oncolnc)
## Classes 'tbl_df', 'tbl' and 'data.frame': 16840 obs. of 6 variables:
## $ GeneName : chr "PTPRN" "ZIC3" "PTPRN2" "OS9" ...
## $ Cox coefficient : num 0.523 -0.382 0.377 0.394 0.38 ...
## $ Raw p-value : num 2.6e-06 4.2e-05 9.2e-05 9.5e-05 1.0e-04 2.6e-04 3.4e-04 3.6e-04 3.7e-04 3.9e-04 ...
## $ BH-adjusted p-value: num 0.0438 0.3368 0.3368 0.3368 0.3368 ...
## $ Median Expression : num 485.8 20.3 1351.7 6624.2 101.9 ...
## $ Mean Expression : num 807 29.2 1547.9 14460.4 168 ...
str(gli.oncolnc)
## Classes 'tbl_df', 'tbl' and 'data.frame': 16822 obs. of 6 variables:
## $ GeneName : chr "TGIF1" "CRTAC1" "CEP112" "ZDHHC22" ...
## $ Cox_coefficient : num 0.876 -0.842 0.802 -0.802 -0.77 ...
## $ Raw_p-value : num 2.2e-16 7.8e-16 3.9e-15 1.9e-14 2.6e-14 ...
## $ BH-adjusted_p-value: num 3.70e-12 6.56e-12 2.19e-11 7.99e-11 8.75e-11 ...
## $ Median_Expression : num 253.5 1030.5 67.2 2556.8 21002 ...
## $ Mean_Expression : num 348.9 1430 97.3 3083.7 22508.3 ...
describe(gbm.precog)
## vars n mean sd
## GeneName* 1 23287 11644.00 6722.52
## precogdata.Name* 2 23261 9054.87 5650.82
## precogdata..Unweighted_meta.Z_of_all_cancers. 3 23287 0.10 2.16
## precogdata.Brain_cancer_Glioblastoma 4 23287 -0.03 1.04
## median trimmed mad
## GeneName* 11644.00 11644.00 8631.70
## precogdata.Name* 8525.00 8851.28 7512.33
## precogdata..Unweighted_meta.Z_of_all_cancers. 0.05 0.05 1.34
## precogdata.Brain_cancer_Glioblastoma 0.00 -0.04 0.75
## min max range
## GeneName* 1.00 23287.00 23286.00
## precogdata.Name* 1.00 19740.00 19739.00
## precogdata..Unweighted_meta.Z_of_all_cancers. -10.66 12.20 22.86
## precogdata.Brain_cancer_Glioblastoma -5.40 4.69 10.09
## skew kurtosis se
## GeneName* 0.00 -1.20 44.05
## precogdata.Name* 0.26 -1.19 37.05
## precogdata..Unweighted_meta.Z_of_all_cancers. 0.41 2.79 0.01
## precogdata.Brain_cancer_Glioblastoma 0.00 1.35 0.01
describe(gbm.oncolnc)
## Warning: NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## vars n mean sd median trimmed mad min
## GeneName* 1 16840 NaN NA NA NaN NA Inf
## Cox coefficient 2 16840 0.02 0.10 0.02 0.02 0.10 -0.38
## Raw p-value 3 16840 0.47 0.29 0.46 0.47 0.39 0.00
## BH-adjusted p-value 4 16840 0.90 0.08 0.91 0.90 0.08 0.04
## Median Expression 5 16840 994.94 3632.94 383.02 543.55 527.58 1.00
## Mean Expression 6 16840 1127.28 4233.28 436.56 609.75 576.47 1.21
## max range skew kurtosis se
## GeneName* -Inf -Inf NA NA NA
## Cox coefficient 0.52 0.91 0.09 0.02 0.00
## Raw p-value 1.00 1.00 0.10 -1.21 0.00
## BH-adjusted p-value 1.00 0.96 -1.02 2.14 0.00
## Median Expression 264903.35 264902.35 35.54 2058.42 28.00
## Mean Expression 309535.63 309534.43 35.76 2078.26 32.62
describe(gli.oncolnc)
## Warning: NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## vars n mean sd median trimmed mad min
## GeneName* 1 16822 NaN NA NA NaN NA Inf
## Cox_coefficient 2 16822 0.03 0.28 0.03 0.03 0.30 -0.84
## Raw_p-value 3 16822 0.19 0.28 0.03 0.14 0.05 0.00
## BH-adjusted_p-value 4 16822 0.23 0.30 0.06 0.18 0.09 0.00
## Median_Expression 5 16822 997.63 3721.54 357.67 541.35 500.10 1.00
## Mean_Expression 6 16822 1114.84 4513.27 416.07 608.95 562.08 1.26
## max range skew kurtosis se
## GeneName* -Inf -Inf NA NA NA
## Cox_coefficient 0.88 1.72 -0.07 -0.53 0.00
## Raw_p-value 1.00 1.00 1.41 0.74 0.00
## BH-adjusted_p-value 1.00 1.00 1.14 -0.07 0.00
## Median_Expression 339528.01 339527.01 50.55 4188.38 28.69
## Mean_Expression 436000.93 435999.67 58.65 5246.81 34.80
Combine different datasets from databases. Subset to contain only genes with significant p-values
gbm.df <- merge(gbm.precog, gbm.oncolnc, by = "GeneName")
gli.df <- merge(gli.precog, gli.oncolnc, by = "GeneName")
head(gbm.df)
## GeneName precogdata.Name
## 1 A1BG Alpha-1-B glycoprotein
## 2 A2M Alpha-2-macroglobulin
## 3 A2ML1 Alpha-2-macroglobulin-like 1
## 4 A4GALT Alpha 1,4-galactosyltransferase
## 5 AAAS Achalasia, adrenocortical insufficiency, alacrimia
## 6 AACS Acetoacetyl-CoA synthetase
## precogdata..Unweighted_meta.Z_of_all_cancers.
## 1 2.185478
## 2 -3.867906
## 3 1.682499
## 4 -1.959646
## 5 2.034914
## 6 3.167680
## precogdata.Brain_cancer_Glioblastoma Cox coefficient Raw p-value
## 1 0.559 -0.04370 0.6200
## 2 0.931 0.09770 0.3200
## 3 2.327 -0.03230 0.7200
## 4 0.491 0.06420 0.4800
## 5 1.500 -0.00354 0.9700
## 6 0.755 0.26010 0.0081
## BH-adjusted p-value Median Expression Mean Expression
## 1 0.9463247 162.3237 222.3394
## 2 0.8832650 18716.6609 22187.0128
## 3 0.9638156 43.9443 100.6319
## 4 0.9180239 82.3080 109.3797
## 5 0.9931177 830.5955 863.0091
## 6 0.6987552 450.6009 467.0564
head(gli.df)
## GeneName precogdata.Name
## 1 A1BG Alpha-1-B glycoprotein
## 2 A2M Alpha-2-macroglobulin
## 3 A2ML1 Alpha-2-macroglobulin-like 1
## 4 A4GALT Alpha 1,4-galactosyltransferase
## 5 AAAS Achalasia, adrenocortical insufficiency, alacrimia
## 6 AACS Acetoacetyl-CoA synthetase
## precogdata..Unweighted_meta.Z_of_all_cancers.
## 1 2.185478
## 2 -3.867906
## 3 1.682499
## 4 -1.959646
## 5 2.034914
## 6 3.167680
## precogdata.Brain_cancer_Glioma Cox_coefficient Raw_p-value
## 1 2.768 0.3334 0.00042
## 2 3.145 0.2621 0.00540
## 3 0.266 0.1944 0.03400
## 4 2.310 0.0716 0.43000
## 5 -0.505 -0.1889 0.04900
## 6 -0.385 -0.0298 0.74000
## BH-adjusted_p-value Median_Expression Mean_Expression
## 1 0.001630565 81.26865 99.26091
## 2 0.014291819 10952.64130 13228.99171
## 3 0.066910154 124.46575 180.39007
## 4 0.531793854 81.99415 105.68930
## 5 0.091322624 753.00550 766.96885
## 6 0.803322148 394.17055 491.00150
names(gbm.df)[2:4] <- c("GeneFunction", "Unweighted_meta_Z_allcancers", "precog_Z")
names(gli.df)[2:4] <- c("GeneFunction", "Unweighted_meta_Z_allcancers", "precog_Z")
sig.gbm<-unique(gbm.df[gbm.df$`Raw p-value` <0.05,])
sig.gli<-(gli.df[gli.df$`Raw_p-value`<0.05,])
gli.genes<-unique(sig.gli[sig.gli$GeneName %in% sig.gbm$GeneName,])
gbm.genes<-unique(sig.gbm[sig.gbm$GeneName %in% gli.genes$GeneName,])
remove <- "4748"
gli.genes <- gli.genes[!rownames(gli.genes) %in% remove,]
standardize num variables to a mean of 0 with a standard deviation of 1 for comparison
gbm.scaled <- scale(gbm.genes[3:9])
gli.scaled <- scale(gli.genes[3:9])
glioblastoma<-cbind(gbm.genes[1:2], gbm.scaled)
glioma<-cbind(gli.genes[1:2], gli.scaled)
names(glioblastoma)[6:7] <-c("pvalue", "bh_pvalue")
names(glioma)[6:7] <-c("pvalue", "bh_pvalue")
describe(glioblastoma)
## vars n mean sd median trimmed
## GeneName* 1 512 12181.14 7721.54 13111.50 12227.34
## GeneFunction* 2 512 10317.76 6227.18 10402.00 10346.91
## Unweighted_meta_Z_allcancers 3 512 0.00 1.00 0.01 -0.02
## precog_Z 4 512 0.00 1.00 -0.03 0.01
## Cox coefficient 5 512 0.00 1.00 0.51 0.08
## pvalue 6 512 0.00 1.00 0.07 0.02
## bh_pvalue 7 512 0.00 1.00 0.40 0.24
## Median Expression 8 512 0.00 1.00 -0.21 -0.16
## Mean Expression 9 512 0.00 1.00 -0.19 -0.15
## mad min max range skew
## GeneName* 10734.02 76.00 23265.00 23189.00 -0.03
## GeneFunction* 8478.25 187.00 19722.00 19535.00 -0.02
## Unweighted_meta_Z_allcancers 0.86 -3.47 4.05 7.52 0.32
## precog_Z 1.01 -3.31 2.59 5.90 -0.13
## Cox coefficient 0.28 -2.31 2.07 4.38 -0.79
## pvalue 1.25 -1.79 1.65 3.45 -0.18
## bh_pvalue 0.13 -9.76 0.61 10.36 -3.65
## Median Expression 0.09 -0.27 16.10 16.37 11.47
## Mean Expression 0.09 -0.26 14.60 14.86 11.79
## kurtosis se
## GeneName* -1.52 341.25
## GeneFunction* -1.40 275.21
## Unweighted_meta_Z_allcancers 1.25 0.04
## precog_Z 0.04 0.04
## Cox coefficient -1.14 0.04
## pvalue -1.16 0.04
## bh_pvalue 21.58 0.04
## Median Expression 158.63 0.04
## Mean Expression 158.32 0.04
describe(glioma)
## vars n mean sd median trimmed
## GeneName* 1 512 12181.14 7721.54 13111.50 12227.34
## GeneFunction* 2 512 10317.76 6227.18 10402.00 10346.91
## Unweighted_meta_Z_allcancers 3 512 0.00 1.00 0.01 -0.02
## precog_Z 4 512 0.00 1.00 -0.04 0.02
## Cox_coefficient 5 512 0.00 1.00 0.44 0.05
## pvalue 6 512 0.00 1.00 -0.53 -0.25
## bh_pvalue 7 512 0.00 1.00 -0.54 -0.23
## Median_Expression 8 512 0.00 1.00 -0.24 -0.17
## Mean_Expression 9 512 0.00 1.00 -0.25 -0.17
## mad min max range skew
## GeneName* 10734.02 76.00 23265.00 23189.00 -0.03
## GeneFunction* 8478.25 187.00 19722.00 19535.00 -0.02
## Unweighted_meta_Z_allcancers 0.86 -3.47 4.05 7.52 0.32
## precog_Z 1.15 -2.48 2.30 4.78 -0.18
## Cox_coefficient 0.86 -2.17 1.58 3.75 -0.46
## pvalue 0.07 -0.58 3.73 4.31 2.01
## bh_pvalue 0.14 -0.63 3.46 4.09 1.78
## Median_Expression 0.12 -0.32 18.31 18.64 12.89
## Mean_Expression 0.13 -0.35 17.92 18.26 12.18
## kurtosis se
## GeneName* -1.52 341.25
## GeneFunction* -1.40 275.21
## Unweighted_meta_Z_allcancers 1.25 0.04
## precog_Z -0.68 0.04
## Cox_coefficient -1.31 0.04
## pvalue 3.17 0.04
## bh_pvalue 2.19 0.04
## Median_Expression 220.29 0.04
## Mean_Expression 201.85 0.04
50 genes are randomly sampled from master list of glioblastoma genes (because it is the smaller set). The same 50 genes are used to cluster both glioma and glioblastoma. Sets are saved to csv file to reproduce results. In this markdownfile the actual sampling code is commented out and a previously generated set is loaded.
#Average-Linkage Clustering
#w/50 randomly sampled genes
#r50.glioblastoma <- glioblastoma[sample(1:nrow(glioblastoma), 50, replace=FALSE),]
#r50.glioma <- glioma[glioma$GeneName %in% r50.glioblastoma$GeneName,]
#save set to CSV to replicate later
#write.csv(r50.glioblastoma, "r50glioblastoma.csv")
#write.csv(r50.glioma, "r50glioma.csv")
r50.glioblastoma <-read.csv("/Users/louisecabansay/Dropbox (Personal)/R Projects/Clustering Project/r50glioblastoma.csv")
r50.glioma <-read.csv("/Users/louisecabansay/Dropbox (Personal)/R Projects/Clustering Project/r50glioma.csv")
row.names(r50.glioblastoma) <- r50.glioblastoma$GeneName
row.names(r50.glioma) <- r50.glioma$GeneName
opar <- par(no.readonly=TRUE) #save current state
par(mfrow=c(2,1)) #layout dendogram hierarchy plots 2 rows, 1 column
#make glioblastoma cluster dendogram
nc.r50.glioblastoma <- r50.glioblastoma[3:9]
d.r50.glioblastoma <- dist(nc.r50.glioblastoma)
fit.gbm.average50 <- hclust(d.r50.glioblastoma, method="average")
plot(fit.gbm.average50, hang=-1, cex=.8, main="50 gene Average Linkage Clustering for Glioblastoma")
#make glioma cluster dendogram
nc.r50.glioma <- r50.glioma[3:9]
d.r50.glioma <- dist(nc.r50.glioma)
fit.gli.average50 <- hclust(d.r50.glioma, method="average")
plot(fit.gli.average50, hang=-1, cex=.8, main="50 gene Average Linkage Clustering for Glioma")
par(opar)
Shows best number of clusters to cluster each dataset using nbclust package. 5 clusters chosen for both cancers for comparability
nc.gli50 <- NbClust(nc.r50.glioma, distance="euclidean",
min.nc=2, max.nc=10, method="average")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 2 proposed 2 as the best number of clusters
## * 14 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
table(nc.gli50$Best.n[1,])
##
## 0 1 2 3 4 6 7 10
## 2 1 2 14 1 2 2 2
nc.gbm50 <- NbClust(nc.r50.glioblastoma, distance="euclidean",
min.nc=2, max.nc=10, method="average")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 4 proposed 7 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
table(nc.gbm50$Best.n[1,])
##
## 0 1 2 3 4 5 6 7 9 10
## 2 1 6 2 2 3 2 4 1 3
par(opar)
par(mfrow=c(1,2)) #layout cluster bar plots next to each other plots 1rows, 2column
barplot(table(nc.gli50$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of GLI Clusters Chosen by 7 Criteria")
barplot(table(nc.gbm50$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of GBM Clusters Chosen by 7 Criteria")
par(opar)
gli.50.clusters <- cutree(fit.gli.average50, k=5)
table(gli.50.clusters) #makes table showing how many genes in each cluster
## gli.50.clusters
## 1 2 3 4 5
## 5 12 27 4 2
aggregate(nc.r50.glioma, by=list(cluster=gli.50.clusters), median)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox_coefficient
## 1 1 -0.15391482 0.1825930 0.3300947
## 2 2 -0.80796876 -1.3586532 -1.4624680
## 3 3 -0.02566294 0.6188862 0.8056032
## 4 4 2.26751030 1.4762832 0.8973953
## 5 5 -1.32538543 -0.6004725 -0.2349430
## pvalue bh_pvalue Median_Expression Mean_Expression
## 1 0.9929519 1.1056295 -0.2763971 -0.3037636
## 2 -0.5782126 -0.6334471 -0.2689102 -0.2837911
## 3 -0.5743581 -0.6230121 -0.2907465 -0.3006326
## 4 -0.5741704 -0.6237840 -0.2687147 -0.2762618
## 5 2.9573277 2.8242766 -0.3024922 -0.3267051
aggregate(as.data.frame(nc.r50.glioma), by=list(cluster=gli.50.clusters),
median)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox_coefficient
## 1 1 -0.15391482 0.1825930 0.3300947
## 2 2 -0.80796876 -1.3586532 -1.4624680
## 3 3 -0.02566294 0.6188862 0.8056032
## 4 4 2.26751030 1.4762832 0.8973953
## 5 5 -1.32538543 -0.6004725 -0.2349430
## pvalue bh_pvalue Median_Expression Mean_Expression
## 1 0.9929519 1.1056295 -0.2763971 -0.3037636
## 2 -0.5782126 -0.6334471 -0.2689102 -0.2837911
## 3 -0.5743581 -0.6230121 -0.2907465 -0.3006326
## 4 -0.5741704 -0.6237840 -0.2687147 -0.2762618
## 5 2.9573277 2.8242766 -0.3024922 -0.3267051
#plot clusters w/ red boxes indicating clusters
plot(fit.gli.average50, hang=-1, cex=.8,
main="GLI Average Linkage Clustering\n5 Cluster Solution")
#overlay red rectangle on cluster
rect.hclust(fit.gli.average50, k=5)
par(opar)
gbm.50.clusters <- cutree(fit.gbm.average50, k=5)
table(gbm.50.clusters)
## gbm.50.clusters
## 1 2 3 4 5
## 7 27 12 3 1
aggregate(nc.r50.glioblastoma, by=list(cluster=gbm.50.clusters), median)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox.coefficient
## 1 1 -0.6379634 -0.25487416 0.8907267
## 2 2 -0.5070486 -0.27490099 0.4328913
## 3 3 0.8711513 1.14736188 0.5451723
## 4 4 0.3226721 -0.27490099 -1.8330551
## 5 5 -2.1683410 -0.08750704 1.4187377
## pvalue bh_pvalue Median.Expression Mean.Expression
## 1 -1.4353588 -1.1645821 -0.2490808 -0.2217467
## 2 1.0214356 0.4871126 -0.2508916 -0.2365749
## 3 0.3174831 0.4012574 -0.2024280 -0.1849876
## 4 -1.7028608 -2.6239851 -0.2359394 -0.2260377
## 5 -1.7662165 -2.7880942 -0.2607908 -0.2423820
aggregate(as.data.frame(nc.r50.glioblastoma), by=list(cluster=gbm.50.clusters),
median)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox.coefficient
## 1 1 -0.6379634 -0.25487416 0.8907267
## 2 2 -0.5070486 -0.27490099 0.4328913
## 3 3 0.8711513 1.14736188 0.5451723
## 4 4 0.3226721 -0.27490099 -1.8330551
## 5 5 -2.1683410 -0.08750704 1.4187377
## pvalue bh_pvalue Median.Expression Mean.Expression
## 1 -1.4353588 -1.1645821 -0.2490808 -0.2217467
## 2 1.0214356 0.4871126 -0.2508916 -0.2365749
## 3 0.3174831 0.4012574 -0.2024280 -0.1849876
## 4 -1.7028608 -2.6239851 -0.2359394 -0.2260377
## 5 -1.7662165 -2.7880942 -0.2607908 -0.2423820
plot(fit.gbm.average50, hang=-1, cex=.8,
main="GBM Average Linkage Clustering\n5 Cluster Solution")
rect.hclust(fit.gbm.average50, k=5)
par(opar)
gli.50.clist <- lapply(sort(unique(gli.50.clusters)), function(x) r50.glioma[which(gli.50.clusters==x),])
gli.cluster.50.1 <- cbind(gli.50.clist[[1]], Cluster = "Gli.Cluster1")
gli.cluster.50.2 <- cbind(gli.50.clist[[2]], Cluster = "Gli.Cluster2")
gli.cluster.50.3 <- cbind(gli.50.clist[[3]], Cluster = "Gli.Cluster3")
gli.cluster.50.4 <- cbind(gli.50.clist[[4]], Cluster = "Gli.Cluster4")
gli.cluster.50.5 <- cbind(gli.50.clist[[5]], Cluster = "Gli.Cluster5")
#gli.cluster.50.6 <- cbind(gli.50.clist[[6]], Cluster = "Gli.Cluster6")
#gli.cluster.50.7 <- cbind(gli.50.clist[[7]], Cluster = "Gli.Cluster7")
#gli.cluster.50.8 <- cbind(gli.50.clist[[8]], Cluster = "Gli.Cluster8")
glioma_50_Genes_C <- rbind(gli.cluster.50.1, gli.cluster.50.2, gli.cluster.50.3, gli.cluster.50.4,
gli.cluster.50.5)# gli.cluster.50.6,gli.cluster.50.7, gli.cluster.50.8)
#save table to csv
write.csv(glioma_50_Genes_C, "Table_Glioma_50genes_5Clust.csv")
glioma_50_Genes_C[,-c(1,3:9)]
## GeneFunction
## AK7 Adenylate kinase 7
## CUX1 Cut-like homeobox 1
## DKFZP586I1420 Hypothetical protein DKFZp586I1420
## FSTL3 Follistatin-like 3 (secreted glycoprotein)
## SCN9A Sodium channel, voltage-gated, type IX, alpha subunit
## ANK1 Ankyrin 1, erythrocytic
## ANO4 Anoctamin 4
## CYP2E1 Cytochrome P450, family 2, subfamily E, polypeptide 1
## DLK2 Delta-like 2 homolog (Drosophila)
## EHD3 EH-domain containing 3
## GRHPR Glyoxylate reductase/hydroxypyruvate reductase
## HECW1 HECT, C2 and WW domain containing E3 ubiquitin protein ligase 1
## HMX1 H6 family homeobox 1
## RIIAD1 Data not found
## SGSM1 Small G protein signaling modulator 1
## SUGT1P3 Suppressor of G2 allele of SKP1 (S. cerevisiae) pseudogene 3
## WNT7B Wingless-type MMTV integration site family, member 7B
## AQP9 Aquaporin 9
## B3GALNT2 Beta-1,3-N-acetylgalactosaminyltransferase 2
## BIRC3 Baculoviral IAP repeat-containing 3
## CCDC69 Coiled-coil domain containing 69
## CPA2 Carboxypeptidase A2 (pancreatic)
## DENND2D DENN/MADD domain containing 2D
## DHX15 DEAH (Asp-Glu-Ala-His) box polypeptide 15
## DUSP5 Dual specificity phosphatase 5
## FAM50B Family with sequence similarity 50, member B
## FAM86B2 Family with sequence similarity 86, member B2
## FGR Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog
## G0S2 G0/G1switch 2
## GJB2 Gap junction protein, beta 2, 26kDa
## PHKB Phosphorylase kinase, beta
## PPARG Peroxisome proliferator-activated receptor gamma
## PRPF38A PRP38 pre-mRNA processing factor 38 (yeast) domain containing A
## PTRF Polymerase I and transcript release factor
## RASSF10 Ras association (RalGDS/AF-6) domain family (N-terminal) member 10
## S100PBP S100P binding protein
## SHISA5 Shisa homolog 5 (Xenopus laevis)
## VASN Vasorin
## ZMYM1 Zinc finger, MYM-type 1
## ZNF175 Zinc finger protein 175
## ZNF432 Zinc finger protein 432
## ZNF543 Zinc finger protein 543
## ZNF644 Zinc finger protein 644
## ZRANB2 Zinc finger, RAN-binding domain containing 2
## CKAP4 Cytoskeleton-associated protein 4
## OPN3 Opsin 3
## PTX3 Pentraxin 3, long
## SLC16A3 Solute carrier family 16, member 3 (monocarboxylic acid transporter 4)
## DAPK2 Death-associated protein kinase 2
## PIP5KL1 Phosphatidylinositol-4-phosphate 5-kinase-like 1
## Cluster
## AK7 Gli.Cluster1
## CUX1 Gli.Cluster1
## DKFZP586I1420 Gli.Cluster1
## FSTL3 Gli.Cluster1
## SCN9A Gli.Cluster1
## ANK1 Gli.Cluster2
## ANO4 Gli.Cluster2
## CYP2E1 Gli.Cluster2
## DLK2 Gli.Cluster2
## EHD3 Gli.Cluster2
## GRHPR Gli.Cluster2
## HECW1 Gli.Cluster2
## HMX1 Gli.Cluster2
## RIIAD1 Gli.Cluster2
## SGSM1 Gli.Cluster2
## SUGT1P3 Gli.Cluster2
## WNT7B Gli.Cluster2
## AQP9 Gli.Cluster3
## B3GALNT2 Gli.Cluster3
## BIRC3 Gli.Cluster3
## CCDC69 Gli.Cluster3
## CPA2 Gli.Cluster3
## DENND2D Gli.Cluster3
## DHX15 Gli.Cluster3
## DUSP5 Gli.Cluster3
## FAM50B Gli.Cluster3
## FAM86B2 Gli.Cluster3
## FGR Gli.Cluster3
## G0S2 Gli.Cluster3
## GJB2 Gli.Cluster3
## PHKB Gli.Cluster3
## PPARG Gli.Cluster3
## PRPF38A Gli.Cluster3
## PTRF Gli.Cluster3
## RASSF10 Gli.Cluster3
## S100PBP Gli.Cluster3
## SHISA5 Gli.Cluster3
## VASN Gli.Cluster3
## ZMYM1 Gli.Cluster3
## ZNF175 Gli.Cluster3
## ZNF432 Gli.Cluster3
## ZNF543 Gli.Cluster3
## ZNF644 Gli.Cluster3
## ZRANB2 Gli.Cluster3
## CKAP4 Gli.Cluster4
## OPN3 Gli.Cluster4
## PTX3 Gli.Cluster4
## SLC16A3 Gli.Cluster4
## DAPK2 Gli.Cluster5
## PIP5KL1 Gli.Cluster5
gbm.50.clist <- lapply(sort(unique(gbm.50.clusters)), function(x) r50.glioblastoma[which(gbm.50.clusters==x),])
gbm.cluster.50.1 <- cbind(gbm.50.clist[[1]], Cluster = "Gbm.Cluster1")
gbm.cluster.50.2 <- cbind(gbm.50.clist[[2]], Cluster = "Gbm.Cluster2")
gbm.cluster.50.3 <- cbind(gbm.50.clist[[3]], Cluster = "Gbm.Cluster3")
gbm.cluster.50.4 <- cbind(gbm.50.clist[[4]], Cluster = "Gbm.Cluster4")
gbm.cluster.50.5 <- cbind(gbm.50.clist[[5]], Cluster = "Gbm.Cluster5")
#gbm.cluster.50.6 <- cbind(gbm.50.clist[[6]], Cluster = "Gbm.Cluster6")
#gbm.cluster.50.7 <- cbind(gbm.50.clist[[7]], Cluster = "Gbm.Cluster7")
#gbm.cluster.50.8 <- cbind(gbm.50.clist[[8]], Cluster = "Gbm.Cluster8")
glioblastoma_50_Genes_C <- rbind(gbm.cluster.50.1, gbm.cluster.50.2, gbm.cluster.50.3, gbm.cluster.50.4,
gbm.cluster.50.5) #gbm.cluster.50.6,gbm.cluster.50.7, gbm.cluster.50.8)
#save cluster table results into csv
#write.csv(glioblastoma_50_Genes_C, "Table_Glioblastoma_50genes_5Clust.csv")
glioblastoma_50_Genes_C[,-c(1,3:9)]
## GeneFunction
## SGSM1 Small G protein signaling modulator 1
## CUX1 Cut-like homeobox 1
## OPN3 Opsin 3
## BIRC3 Baculoviral IAP repeat-containing 3
## DENND2D DENN/MADD domain containing 2D
## ANO4 Anoctamin 4
## GJB2 Gap junction protein, beta 2, 26kDa
## DHX15 DEAH (Asp-Glu-Ala-His) box polypeptide 15
## PIP5KL1 Phosphatidylinositol-4-phosphate 5-kinase-like 1
## RIIAD1 Data not found
## ZMYM1 Zinc finger, MYM-type 1
## ZNF175 Zinc finger protein 175
## S100PBP S100P binding protein
## CPA2 Carboxypeptidase A2 (pancreatic)
## FAM86B2 Family with sequence similarity 86, member B2
## PHKB Phosphorylase kinase, beta
## PRPF38A PRP38 pre-mRNA processing factor 38 (yeast) domain containing A
## SCN9A Sodium channel, voltage-gated, type IX, alpha subunit
## EHD3 EH-domain containing 3
## SHISA5 Shisa homolog 5 (Xenopus laevis)
## DLK2 Delta-like 2 homolog (Drosophila)
## WNT7B Wingless-type MMTV integration site family, member 7B
## HMX1 H6 family homeobox 1
## ZNF543 Zinc finger protein 543
## CYP2E1 Cytochrome P450, family 2, subfamily E, polypeptide 1
## HECW1 HECT, C2 and WW domain containing E3 ubiquitin protein ligase 1
## DKFZP586I1420 Hypothetical protein DKFZp586I1420
## CCDC69 Coiled-coil domain containing 69
## ZNF644 Zinc finger protein 644
## DAPK2 Death-associated protein kinase 2
## ZRANB2 Zinc finger, RAN-binding domain containing 2
## GRHPR Glyoxylate reductase/hydroxypyruvate reductase
## AK7 Adenylate kinase 7
## RASSF10 Ras association (RalGDS/AF-6) domain family (N-terminal) member 10
## VASN Vasorin
## PPARG Peroxisome proliferator-activated receptor gamma
## CKAP4 Cytoskeleton-associated protein 4
## DUSP5 Dual specificity phosphatase 5
## FAM50B Family with sequence similarity 50, member B
## FGR Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog
## FSTL3 Follistatin-like 3 (secreted glycoprotein)
## PTX3 Pentraxin 3, long
## AQP9 Aquaporin 9
## SLC16A3 Solute carrier family 16, member 3 (monocarboxylic acid transporter 4)
## G0S2 G0/G1switch 2
## PTRF Polymerase I and transcript release factor
## ZNF432 Zinc finger protein 432
## SUGT1P3 Suppressor of G2 allele of SKP1 (S. cerevisiae) pseudogene 3
## B3GALNT2 Beta-1,3-N-acetylgalactosaminyltransferase 2
## ANK1 Ankyrin 1, erythrocytic
## Cluster
## SGSM1 Gbm.Cluster1
## CUX1 Gbm.Cluster1
## OPN3 Gbm.Cluster1
## BIRC3 Gbm.Cluster1
## DENND2D Gbm.Cluster1
## ANO4 Gbm.Cluster1
## GJB2 Gbm.Cluster1
## DHX15 Gbm.Cluster2
## PIP5KL1 Gbm.Cluster2
## RIIAD1 Gbm.Cluster2
## ZMYM1 Gbm.Cluster2
## ZNF175 Gbm.Cluster2
## S100PBP Gbm.Cluster2
## CPA2 Gbm.Cluster2
## FAM86B2 Gbm.Cluster2
## PHKB Gbm.Cluster2
## PRPF38A Gbm.Cluster2
## SCN9A Gbm.Cluster2
## EHD3 Gbm.Cluster2
## SHISA5 Gbm.Cluster2
## DLK2 Gbm.Cluster2
## WNT7B Gbm.Cluster2
## HMX1 Gbm.Cluster2
## ZNF543 Gbm.Cluster2
## CYP2E1 Gbm.Cluster2
## HECW1 Gbm.Cluster2
## DKFZP586I1420 Gbm.Cluster2
## CCDC69 Gbm.Cluster2
## ZNF644 Gbm.Cluster2
## DAPK2 Gbm.Cluster2
## ZRANB2 Gbm.Cluster2
## GRHPR Gbm.Cluster2
## AK7 Gbm.Cluster2
## RASSF10 Gbm.Cluster2
## VASN Gbm.Cluster3
## PPARG Gbm.Cluster3
## CKAP4 Gbm.Cluster3
## DUSP5 Gbm.Cluster3
## FAM50B Gbm.Cluster3
## FGR Gbm.Cluster3
## FSTL3 Gbm.Cluster3
## PTX3 Gbm.Cluster3
## AQP9 Gbm.Cluster3
## SLC16A3 Gbm.Cluster3
## G0S2 Gbm.Cluster3
## PTRF Gbm.Cluster3
## ZNF432 Gbm.Cluster4
## SUGT1P3 Gbm.Cluster4
## B3GALNT2 Gbm.Cluster4
## ANK1 Gbm.Cluster5
For markdownfile a previously generated random gene list was loaded
#generate 100 random genes that are significant in both gli and gbm
#grabs from glioblastoma & glioma dfs (contains same genes)
#r100.glioblastoma <- glioblastoma[sample(1:nrow(glioblastoma), 100, replace=FALSE),]
#r100.glioma <- glioma[glioma$GeneName %in% r100.glioblastoma$GeneName,]
#save randomly generated gene df into csv for reproducibility
#write.csv(r100.glioblastoma, "r100glioblastoma.csv")
#write.csv(r100.glioma, "r100glioma.csv")
r100.glioblastoma <-read.csv("/Users/louisecabansay/Dropbox (Personal)/R Projects/Clustering Project/r100glioblastoma.csv")
r100.glioma <-read.csv("/Users/louisecabansay/Dropbox (Personal)/R Projects/Clustering Project/r100glioma.csv")
row.names(r100.glioblastoma) <- r100.glioblastoma$GeneName
row.names(r100.glioma) <- r100.glioma$GeneName
par(opar)
nc.r100.glioblastoma <- r100.glioblastoma[3:9]
d.r100.glioblastoma <- dist(nc.r100.glioblastoma)
fit.gbm.ward100 <- hclust(d.r100.glioblastoma, method="ward")
plot(fit.gbm.ward100, hang=-1, cex=.8, main="100 Gene Ward Method Clustering: Glioblastoma")
nc.r100.glioma <- r100.glioma[3:9]
d.r100.glioma <- dist(nc.r100.glioma)
fit.gli.ward100 <- hclust(d.r100.glioma, method="ward")
plot(fit.gli.ward100, hang=-1, cex=.8, main="100 Gene Ward Method Clustering: Glioma")
par(opar)
#find best number of clusters for 100 genes hierarchial clustering
#find for 100 gli genes
nc.gli.100 <- NbClust(nc.r100.glioma, distance="euclidean",
min.nc=2, max.nc=20, method="ward.D")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 1 proposed 2 as the best number of clusters
## * 17 proposed 3 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 1 proposed 17 as the best number of clusters
## * 3 proposed 20 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
table(nc.gli.100$Best.n[1,])
##
## 0 1 2 3 10 17 20
## 2 1 1 17 1 1 3
barplot(table(nc.gli.100$Best.n[1,]),
xlab="Number of Clusters", ylab="Number of Criteria",
main="Number of Ward Clusters Chosen by 7 Criteria: 100 GLI")
par(opar)
#find for 100 gbm genes
nc.gbm.100 <- NbClust(nc.r100.glioblastoma, distance="euclidean",
min.nc=2, max.nc=20, method="ward.D")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 5 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 16 as the best number of clusters
## * 2 proposed 17 as the best number of clusters
## * 1 proposed 18 as the best number of clusters
## * 2 proposed 20 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
table(nc.gbm.100$Best.n[1,])
##
## 0 1 2 3 4 5 6 7 16 17 18 20
## 2 1 6 2 5 1 1 2 1 2 1 2
barplot(table(nc.gbm.100$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Ward Clusters Chosen by 7 Criteria: 100 GBM")
par(opar)
#plot 100 gene glioma dendogram w/red boxes around 10 clusters
gli.100.clusters <- cutree(fit.gli.ward100, k=10)
table(gli.100.clusters)
## gli.100.clusters
## 1 2 3 4 5 6 7 8 9 10
## 19 11 18 7 6 12 6 5 8 8
aggregate(nc.r100.glioma, by=list(cluster=gli.100.clusters), median)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox_coefficient
## 1 1 -0.33723915 0.4964010 0.6901913
## 2 2 -0.06897342 0.3002306 0.4129763
## 3 3 0.98065569 1.1576276 0.8579346
## 4 4 -0.38238700 -0.2284922 0.2786119
## 5 5 -0.37008446 -1.1458391 -0.9033706
## 6 6 -0.61686806 -1.5449666 -1.4505873
## 7 7 -1.49974619 0.4718393 0.8993754
## 8 8 1.25698724 0.1825930 0.3300947
## 9 9 0.16445403 -0.1657952 -1.4911795
## 10 10 0.41196540 0.6825527 0.8172010
## pvalue bh_pvalue Median_Expression Mean_Expression
## 1 -0.5436265 -0.5653951 -0.26781673 -0.26493921
## 2 0.1810099 0.3120433 -0.26704794 -0.28712766
## 3 -0.5748383 -0.6246490 -0.24344459 -0.26957030
## 4 2.3898414 2.3416708 -0.24959806 -0.27818242
## 5 0.6000767 0.7290179 -0.09670977 -0.08369619
## 6 -0.5779551 -0.6325875 0.02690056 0.12284067
## 7 -0.5736728 -0.6228476 -0.15754173 -0.18220353
## 8 2.0406190 2.0450908 -0.25866463 -0.27201661
## 9 -0.5780162 -0.6327654 -0.30060345 -0.32510579
## 10 -0.5757986 -0.6268340 0.67110837 0.65836299
aggregate(as.data.frame(nc.r100.glioma), by=list(cluster=gli.100.clusters),
median)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox_coefficient
## 1 1 -0.33723915 0.4964010 0.6901913
## 2 2 -0.06897342 0.3002306 0.4129763
## 3 3 0.98065569 1.1576276 0.8579346
## 4 4 -0.38238700 -0.2284922 0.2786119
## 5 5 -0.37008446 -1.1458391 -0.9033706
## 6 6 -0.61686806 -1.5449666 -1.4505873
## 7 7 -1.49974619 0.4718393 0.8993754
## 8 8 1.25698724 0.1825930 0.3300947
## 9 9 0.16445403 -0.1657952 -1.4911795
## 10 10 0.41196540 0.6825527 0.8172010
## pvalue bh_pvalue Median_Expression Mean_Expression
## 1 -0.5436265 -0.5653951 -0.26781673 -0.26493921
## 2 0.1810099 0.3120433 -0.26704794 -0.28712766
## 3 -0.5748383 -0.6246490 -0.24344459 -0.26957030
## 4 2.3898414 2.3416708 -0.24959806 -0.27818242
## 5 0.6000767 0.7290179 -0.09670977 -0.08369619
## 6 -0.5779551 -0.6325875 0.02690056 0.12284067
## 7 -0.5736728 -0.6228476 -0.15754173 -0.18220353
## 8 2.0406190 2.0450908 -0.25866463 -0.27201661
## 9 -0.5780162 -0.6327654 -0.30060345 -0.32510579
## 10 -0.5757986 -0.6268340 0.67110837 0.65836299
plot(fit.gli.ward100, hang=-1, cex=.8,
main="Glioma 100 Gene Ward Method Clustering\n10 Cluster Solution")
#outline k clusters in red boxes on dendogram
rect.hclust(fit.gli.ward100, k=10)
par(opar)
#plot 100 gene glioblastoma dendogram w/redboxes around 10 clusters
gbm.100.clusters <- cutree(fit.gbm.ward100, k=10)
table(gbm.100.clusters)
## gbm.100.clusters
## 1 2 3 4 5 6 7 8 9 10
## 20 13 7 14 7 7 11 9 7 5
aggregate(nc.r100.glioblastoma, by=list(cluster=gbm.100.clusters), median)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox.coefficient
## 1 1 0.44741374 -0.4007840 -1.3999022
## 2 2 -0.01287706 0.1406558 0.7087541
## 3 3 0.86335265 1.5668525 0.8096134
## 4 4 -0.02079043 -0.2402117 0.5475921
## 5 5 -0.63228624 0.6921090 0.4590257
## 6 6 0.03562454 -0.4529968 0.9667099
## 7 7 -0.99496397 -0.7083389 0.5437204
## 8 8 1.13109766 1.1448585 0.5258135
## 9 9 -0.66394158 -0.8184865 -1.8190200
## 10 10 -0.64561638 -1.9292606 -1.4371679
## pvalue bh_pvalue Median.Expression Mean.Expression
## 1 0.45827358 0.4161397 -0.21235303 -0.20642396
## 2 -0.87923628 0.1080645 0.09196304 0.06829423
## 3 -1.09042205 -0.5376288 -0.15761511 -0.16384789
## 4 0.59906410 0.4737663 -0.24992118 -0.23029184
## 5 0.73985461 0.4835080 -0.19285941 -0.18598142
## 6 -1.60430742 -2.0615623 -0.22935145 -0.19490796
## 7 -0.03449321 0.4012574 -0.20762862 -0.20072240
## 8 0.38787833 0.4012574 -0.20532013 -0.18526386
## 9 -1.53391216 -1.5142691 -0.16881740 -0.17033129
## 10 0.66945935 0.4835080 0.10732303 0.04213735
aggregate(as.data.frame(nc.r100.glioblastoma), by=list(cluster=gbm.100.clusters),
median)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox.coefficient
## 1 1 0.44741374 -0.4007840 -1.3999022
## 2 2 -0.01287706 0.1406558 0.7087541
## 3 3 0.86335265 1.5668525 0.8096134
## 4 4 -0.02079043 -0.2402117 0.5475921
## 5 5 -0.63228624 0.6921090 0.4590257
## 6 6 0.03562454 -0.4529968 0.9667099
## 7 7 -0.99496397 -0.7083389 0.5437204
## 8 8 1.13109766 1.1448585 0.5258135
## 9 9 -0.66394158 -0.8184865 -1.8190200
## 10 10 -0.64561638 -1.9292606 -1.4371679
## pvalue bh_pvalue Median.Expression Mean.Expression
## 1 0.45827358 0.4161397 -0.21235303 -0.20642396
## 2 -0.87923628 0.1080645 0.09196304 0.06829423
## 3 -1.09042205 -0.5376288 -0.15761511 -0.16384789
## 4 0.59906410 0.4737663 -0.24992118 -0.23029184
## 5 0.73985461 0.4835080 -0.19285941 -0.18598142
## 6 -1.60430742 -2.0615623 -0.22935145 -0.19490796
## 7 -0.03449321 0.4012574 -0.20762862 -0.20072240
## 8 0.38787833 0.4012574 -0.20532013 -0.18526386
## 9 -1.53391216 -1.5142691 -0.16881740 -0.17033129
## 10 0.66945935 0.4835080 0.10732303 0.04213735
plot(fit.gbm.ward100, hang=-1, cex=.8,
main="Glioblastoma 100 Gene Ward Method Clustering\n10 Cluster Solution")
#outline k clusters in red boxes on dendogram
rect.hclust(fit.gbm.ward100, k=10)
#K-means cluster plot of 100 glioblastoma genes
km.100.glioblastoma <- kmeans(nc.r100.glioblastoma, 4, iter.max = 10, nstart=25)
km.100.glioblastoma$size
## [1] 25 13 41 21
km.100.glioblastoma$centers
## Unweighted_meta_Z_allcancers precog_Z Cox.coefficient pvalue
## 1 0.08758045 -0.78979091 -1.4192803 0.4948791
## 2 -0.44739658 -0.71538132 -0.5587032 -1.5204288
## 3 -0.32099130 -0.02533314 0.5321759 0.5132162
## 4 0.79368490 1.00398950 0.7440931 -0.8745433
## bh_pvalue Median.Expression Mean.Expression
## 1 0.4391912 0.01458218 -0.01341659
## 2 -1.7269062 -0.17738145 -0.17116306
## 3 0.4379154 -0.05137820 -0.05182371
## 4 -0.3718029 0.03146196 0.02840369
aggregate(nc.r100.glioblastoma, by=list(cluster=km.100.glioblastoma$cluster), mean)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox.coefficient
## 1 1 0.08758045 -0.78979091 -1.4192803
## 2 2 -0.44739658 -0.71538132 -0.5587032
## 3 3 -0.32099130 -0.02533314 0.5321759
## 4 4 0.79368490 1.00398950 0.7440931
## pvalue bh_pvalue Median.Expression Mean.Expression
## 1 0.4948791 0.4391912 0.01458218 -0.01341659
## 2 -1.5204288 -1.7269062 -0.17738145 -0.17116306
## 3 0.5132162 0.4379154 -0.05137820 -0.05182371
## 4 -0.8745433 -0.3718029 0.03146196 0.02840369
clusplot(nc.r100.glioblastoma, km.100.glioblastoma$cluster, color=TRUE, shade=TRUE,
labels=1, lines=0, main = "Glioblastoma 100 gene K-means Cluster Plot, Outliers Removed")
#Generate tables for ward clustering results
#100 gene glioma 10 cluster table
gli.100.clist <- lapply(sort(unique(gli.100.clusters)), function(x) r100.glioma[which(gli.100.clusters==x),])
gli.cluster.100.1 <- cbind(gli.100.clist[[1]], Cluster = "C1")
gli.cluster.100.2 <- cbind(gli.100.clist[[2]], Cluster = "C2")
gli.cluster.100.3 <- cbind(gli.100.clist[[3]], Cluster = "C3")
gli.cluster.100.4 <- cbind(gli.100.clist[[4]], Cluster = "C4")
gli.cluster.100.5 <- cbind(gli.100.clist[[5]], Cluster = "C5")
gli.cluster.100.6 <- cbind(gli.100.clist[[6]], Cluster = "C6")
gli.cluster.100.7 <- cbind(gli.100.clist[[7]], Cluster = "C7")
gli.cluster.100.8 <- cbind(gli.100.clist[[8]], Cluster = "C8")
gli.cluster.100.9 <- cbind(gli.100.clist[[9]], Cluster = "C9")
gli.cluster.100.10 <- cbind(gli.100.clist[[10]], Cluster = "C10")
glioma_100_Genes_C <- rbind(gli.cluster.100.1, gli.cluster.100.2, gli.cluster.100.3, gli.cluster.100.4,
gli.cluster.100.5,gli.cluster.100.6,gli.cluster.100.7, gli.cluster.100.8,
gli.cluster.100.9, gli.cluster.100.10)
glioma_100_Genes_C[,-c(1,3:9)]
## GeneFunction
## ABCB7 ATP-binding cassette, sub-family B (MDR/TAP), member 7
## ADPRH ADP-ribosylarginine hydrolase
## C2 Complement component 2
## CCNL2 Cyclin L2
## CPM Carboxypeptidase M
## DHH Desert hedgehog
## FGR Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog
## FKBP9 FK506 binding protein 9, 63 kDa
## HS6ST1 Heparan sulfate 6-O-sulfotransferase 1
## MALT1 Mucosa associated lymphoid tissue lymphoma translocation gene 1
## OSCAR Osteoclast associated, immunoglobulin-like receptor
## PHKB Phosphorylase kinase, beta
## RASSF10 Ras association (RalGDS/AF-6) domain family (N-terminal) member 10
## RRN3P2 RNA polymerase I transcription factor homolog (S. cerevisiae) pseudogene 2
## SNX10 Sorting nexin 10
## SULF1 Sulfatase 1
## ZNF235 In multiple clusters
## ZNF432 Zinc finger protein 432
## ZNF69 Zinc finger protein 69
## ADAMTS14 ADAM metallopeptidase with thrombospondin type 1 motif, 14
## CCDC114 Coiled-coil domain containing 114
## CD1D CD1d molecule
## CMPK1 Cytidine monophosphate (UMP-CMP) kinase 1, cytosolic
## KLF10 Kruppel-like factor 10
## LMO7 LIM domain 7
## PPM1J Protein phosphatase, Mg2+/Mn2+ dependent, 1J
## SARDH Sarcosine dehydrogenase
## TMTC1 Transmembrane and tetratricopeptide repeat containing 1
## TNIP1 TNFAIP3 interacting protein 1
## ZNF175 Zinc finger protein 175
## AEBP1 AE binding protein 1
## ARMC10 Armadillo repeat containing 10
## CDCP1 CUB domain containing protein 1
## CNPY4 Canopy 4 homolog (zebrafish)
## EPHB2 EPH receptor B2
## ERRFI1 ERBB receptor feedback inhibitor 1
## GALE UDP-galactose-4-epimerase
## HMOX1 Heme oxygenase (decycling) 1
## ISG20 Interferon stimulated exonuclease gene 20kDa
## LOXL1 Lysyl oxidase-like 1
## PODNL1 Podocan-like 1
## PTX3 Pentraxin 3, long
## SKP2 S-phase kinase-associated protein 2 (p45)
## SPAG4 Sperm associated antigen 4
## UBE2J2 Ubiquitin-conjugating enzyme E2, J2 (UBC6 homolog, yeast)
## ZNF320 Zinc finger protein 320
## ZNF480 Zinc finger protein 480
## ZNF765 Zinc finger protein 765
## ALOX15B Arachidonate 15-lipoxygenase, type B
## CHST1 Carbohydrate (keratan sulfate Gal-6) sulfotransferase 1
## FIGF C-fos induced growth factor (vascular endothelial growth factor D)
## KPNA5 Karyopherin alpha 5 (importin alpha 6)
## RCOR3 REST corepressor 3
## UBE2Z Ubiquitin-conjugating enzyme E2Z
## ZNF136 Zinc finger protein 136
## ANKH Ankylosis, progressive homolog (mouse)
## GUCA1B Guanylate cyclase activator 1B (retina)
## INPP5F Inositol polyphosphate-5-phosphatase F
## NRCAM Neuronal cell adhesion molecule
## SLC25A48 Solute carrier family 25, member 48
## ZBTB6 Zinc finger and BTB domain containing 6
## AP3B2 Adaptor-related protein complex 3, beta 2 subunit
## APBB1 Amyloid beta (A4) precursor protein-binding, family B, member 1 (Fe65)
## FAM149A Family with sequence similarity 149, member A
## GDI2 GDP dissociation inhibitor 2
## GRHPR Glyoxylate reductase/hydroxypyruvate reductase
## L1CAM L1 cell adhesion molecule
## NRXN2 Neurexin 2
## OLFM1 Olfactomedin 1
## RAB3C RAB3C, member RAS oncogene family
## RRAGB Ras-related GTP binding B
## TAF9B TAF9B RNA polymerase II, TATA box binding protein (TBP)-associated factor, 31kDa
## TTC9 Tetratricopeptide repeat domain 9
## CCBL2 Cysteine conjugate-beta lyase 2
## H6PD Hexose-6-phosphate dehydrogenase (glucose 1-dehydrogenase)
## MAN2B2 Mannosidase, alpha, class 2B, member 2
## MIER1 Mesoderm induction early response 1 homolog (Xenopus laevis)
## PNPLA4 Patatin-like phospholipase domain containing 4
## SLC9A1 Solute carrier family 9 (sodium/hydrogen exchanger), member 1
## EPS8L2 EPS8-like 2
## FSTL3 Follistatin-like 3 (secreted glycoprotein)
## NSUN5 NOP2/Sun domain family, member 5
## TSEN15 TRNA splicing endonuclease 15 homolog (S. cerevisiae)
## ZNF616 In multiple clusters
## FLJ23867 Hypothetical protein FLJ23867
## HIST3H2A Histone cluster 3, H2a
## HMX1 H6 family homeobox 1
## MCM3AP-AS1 MCM3AP antisense RNA 1 (non-protein coding)
## PNMA5 Paraneoplastic antigen like 5
## SCARB1 Data not found
## SLC9A2 Solute carrier family 9 (sodium/hydrogen exchanger), member 2
## VWC2L Von Willebrand factor C domain-containing protein 2-like
## GRN Granulin
## ID3 Inhibitor of DNA binding 3, dominant negative helix-loop-helix protein
## ITGB5 Integrin, beta 5
## ORAI2 ORAI calcium release-activated calcium modulator 2
## PLD3 Phospholipase D family, member 3
## RBBP4 Retinoblastoma binding protein 4
## SERBP1 SERPINE1 mRNA binding protein 1
## SHISA5 Shisa homolog 5 (Xenopus laevis)
## Cluster
## ABCB7 C1
## ADPRH C1
## C2 C1
## CCNL2 C1
## CPM C1
## DHH C1
## FGR C1
## FKBP9 C1
## HS6ST1 C1
## MALT1 C1
## OSCAR C1
## PHKB C1
## RASSF10 C1
## RRN3P2 C1
## SNX10 C1
## SULF1 C1
## ZNF235 C1
## ZNF432 C1
## ZNF69 C1
## ADAMTS14 C2
## CCDC114 C2
## CD1D C2
## CMPK1 C2
## KLF10 C2
## LMO7 C2
## PPM1J C2
## SARDH C2
## TMTC1 C2
## TNIP1 C2
## ZNF175 C2
## AEBP1 C3
## ARMC10 C3
## CDCP1 C3
## CNPY4 C3
## EPHB2 C3
## ERRFI1 C3
## GALE C3
## HMOX1 C3
## ISG20 C3
## LOXL1 C3
## PODNL1 C3
## PTX3 C3
## SKP2 C3
## SPAG4 C3
## UBE2J2 C3
## ZNF320 C3
## ZNF480 C3
## ZNF765 C3
## ALOX15B C4
## CHST1 C4
## FIGF C4
## KPNA5 C4
## RCOR3 C4
## UBE2Z C4
## ZNF136 C4
## ANKH C5
## GUCA1B C5
## INPP5F C5
## NRCAM C5
## SLC25A48 C5
## ZBTB6 C5
## AP3B2 C6
## APBB1 C6
## FAM149A C6
## GDI2 C6
## GRHPR C6
## L1CAM C6
## NRXN2 C6
## OLFM1 C6
## RAB3C C6
## RRAGB C6
## TAF9B C6
## TTC9 C6
## CCBL2 C7
## H6PD C7
## MAN2B2 C7
## MIER1 C7
## PNPLA4 C7
## SLC9A1 C7
## EPS8L2 C8
## FSTL3 C8
## NSUN5 C8
## TSEN15 C8
## ZNF616 C8
## FLJ23867 C9
## HIST3H2A C9
## HMX1 C9
## MCM3AP-AS1 C9
## PNMA5 C9
## SCARB1 C9
## SLC9A2 C9
## VWC2L C9
## GRN C10
## ID3 C10
## ITGB5 C10
## ORAI2 C10
## PLD3 C10
## RBBP4 C10
## SERBP1 C10
## SHISA5 C10
#100 gene glioblastoma 10 cluster table
gbm.100.clist <- lapply(sort(unique(gbm.100.clusters)), function(x) r100.glioblastoma[which(gbm.100.clusters==x),])
gbm.cluster.100.1 <- cbind(gbm.100.clist[[1]], Cluster = "C1")
gbm.cluster.100.2 <- cbind(gbm.100.clist[[2]], Cluster = "C2")
gbm.cluster.100.3 <- cbind(gbm.100.clist[[3]], Cluster = "C3")
gbm.cluster.100.4 <- cbind(gbm.100.clist[[4]], Cluster = "C4")
gbm.cluster.100.5 <- cbind(gbm.100.clist[[5]], Cluster = "C5")
gbm.cluster.100.6 <- cbind(gbm.100.clist[[6]], Cluster = "C6")
gbm.cluster.100.7 <- cbind(gbm.100.clist[[7]], Cluster = "C7")
gbm.cluster.100.8 <- cbind(gbm.100.clist[[8]], Cluster = "C8")
gbm.cluster.100.9 <- cbind(gbm.100.clist[[9]], Cluster = "C9")
gbm.cluster.100.10 <- cbind(gbm.100.clist[[10]], Cluster = "C10")
glioblastoma_100_Genes_C <- rbind(gbm.cluster.100.1, gbm.cluster.100.2, gbm.cluster.100.3, gbm.cluster.100.4,
gbm.cluster.100.5,gbm.cluster.100.6,gbm.cluster.100.7, gbm.cluster.100.8,
gbm.cluster.100.9, gbm.cluster.100.10)
glioblastoma_100_Genes_C[,-c(1,3:9)]
## GeneFunction
## SKP2 S-phase kinase-associated protein 2 (p45)
## ZNF616 In multiple clusters
## TAF9B TAF9B RNA polymerase II, TATA box binding protein (TBP)-associated factor, 31kDa
## GUCA1B Guanylate cyclase activator 1B (retina)
## SERBP1 SERPINE1 mRNA binding protein 1
## SLC25A48 Solute carrier family 25, member 48
## TSEN15 TRNA splicing endonuclease 15 homolog (S. cerevisiae)
## ZNF480 Zinc finger protein 480
## ZNF320 Zinc finger protein 320
## ZNF175 Zinc finger protein 175
## ID3 Inhibitor of DNA binding 3, dominant negative helix-loop-helix protein
## CMPK1 Cytidine monophosphate (UMP-CMP) kinase 1, cytosolic
## ZNF69 Zinc finger protein 69
## RBBP4 Retinoblastoma binding protein 4
## ZNF235 In multiple clusters
## CCNL2 Cyclin L2
## UBE2J2 Ubiquitin-conjugating enzyme E2, J2 (UBC6 homolog, yeast)
## HIST3H2A Histone cluster 3, H2a
## ZNF765 Zinc finger protein 765
## MCM3AP-AS1 MCM3AP antisense RNA 1 (non-protein coding)
## PLD3 Phospholipase D family, member 3
## PPM1J Protein phosphatase, Mg2+/Mn2+ dependent, 1J
## NRCAM Neuronal cell adhesion molecule
## KLF10 Kruppel-like factor 10
## AEBP1 AE binding protein 1
## ARMC10 Armadillo repeat containing 10
## AP3B2 Adaptor-related protein complex 3, beta 2 subunit
## SCARB1 Data not found
## OLFM1 Olfactomedin 1
## GRN Granulin
## ORAI2 ORAI calcium release-activated calcium modulator 2
## FKBP9 FK506 binding protein 9, 63 kDa
## GALE UDP-galactose-4-epimerase
## ITGB5 Integrin, beta 5
## CNPY4 Canopy 4 homolog (zebrafish)
## ISG20 Interferon stimulated exonuclease gene 20kDa
## NSUN5 NOP2/Sun domain family, member 5
## ERRFI1 ERBB receptor feedback inhibitor 1
## SARDH Sarcosine dehydrogenase
## PODNL1 Podocan-like 1
## ADAMTS14 ADAM metallopeptidase with thrombospondin type 1 motif, 14
## APBB1 Amyloid beta (A4) precursor protein-binding, family B, member 1 (Fe65)
## OSCAR Osteoclast associated, immunoglobulin-like receptor
## CCDC114 Coiled-coil domain containing 114
## EPHB2 EPH receptor B2
## HS6ST1 Heparan sulfate 6-O-sulfotransferase 1
## VWC2L Von Willebrand factor C domain-containing protein 2-like
## PNMA5 Paraneoplastic antigen like 5
## CHST1 Carbohydrate (keratan sulfate Gal-6) sulfotransferase 1
## SHISA5 Shisa homolog 5 (Xenopus laevis)
## FIGF C-fos induced growth factor (vascular endothelial growth factor D)
## UBE2Z Ubiquitin-conjugating enzyme E2Z
## RASSF10 Ras association (RalGDS/AF-6) domain family (N-terminal) member 10
## SLC9A2 Solute carrier family 9 (sodium/hydrogen exchanger), member 2
## C2 Complement component 2
## FGR Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog
## MAN2B2 Mannosidase, alpha, class 2B, member 2
## MALT1 Mucosa associated lymphoid tissue lymphoma translocation gene 1
## CPM Carboxypeptidase M
## FAM149A Family with sequence similarity 149, member A
## PNPLA4 Patatin-like phospholipase domain containing 4
## RAB3C RAB3C, member RAS oncogene family
## L1CAM L1 cell adhesion molecule
## TMTC1 Transmembrane and tetratricopeptide repeat containing 1
## LMO7 LIM domain 7
## ANKH Ankylosis, progressive homolog (mouse)
## FLJ23867 Hypothetical protein FLJ23867
## LOXL1 Lysyl oxidase-like 1
## HMX1 H6 family homeobox 1
## INPP5F Inositol polyphosphate-5-phosphatase F
## H6PD Hexose-6-phosphate dehydrogenase (glucose 1-dehydrogenase)
## NRXN2 Neurexin 2
## CD1D CD1d molecule
## TTC9 Tetratricopeptide repeat domain 9
## SULF1 Sulfatase 1
## DHH Desert hedgehog
## ADPRH ADP-ribosylarginine hydrolase
## SLC9A1 Solute carrier family 9 (sodium/hydrogen exchanger), member 1
## ALOX15B Arachidonate 15-lipoxygenase, type B
## PTX3 Pentraxin 3, long
## CDCP1 CUB domain containing protein 1
## SNX10 Sorting nexin 10
## FSTL3 Follistatin-like 3 (secreted glycoprotein)
## SPAG4 Sperm associated antigen 4
## HMOX1 Heme oxygenase (decycling) 1
## TNIP1 TNFAIP3 interacting protein 1
## RRN3P2 RNA polymerase I transcription factor homolog (S. cerevisiae) pseudogene 2
## EPS8L2 EPS8-like 2
## MIER1 Mesoderm induction early response 1 homolog (Xenopus laevis)
## ZNF136 Zinc finger protein 136
## ZBTB6 Zinc finger and BTB domain containing 6
## CCBL2 Cysteine conjugate-beta lyase 2
## ABCB7 ATP-binding cassette, sub-family B (MDR/TAP), member 7
## ZNF432 Zinc finger protein 432
## RCOR3 REST corepressor 3
## GRHPR Glyoxylate reductase/hydroxypyruvate reductase
## GDI2 GDP dissociation inhibitor 2
## RRAGB Ras-related GTP binding B
## KPNA5 Karyopherin alpha 5 (importin alpha 6)
## PHKB Phosphorylase kinase, beta
## Cluster
## SKP2 C1
## ZNF616 C1
## TAF9B C1
## GUCA1B C1
## SERBP1 C1
## SLC25A48 C1
## TSEN15 C1
## ZNF480 C1
## ZNF320 C1
## ZNF175 C1
## ID3 C1
## CMPK1 C1
## ZNF69 C1
## RBBP4 C1
## ZNF235 C1
## CCNL2 C1
## UBE2J2 C1
## HIST3H2A C1
## ZNF765 C1
## MCM3AP-AS1 C1
## PLD3 C2
## PPM1J C2
## NRCAM C2
## KLF10 C2
## AEBP1 C2
## ARMC10 C2
## AP3B2 C2
## SCARB1 C2
## OLFM1 C2
## GRN C2
## ORAI2 C2
## FKBP9 C2
## GALE C2
## ITGB5 C3
## CNPY4 C3
## ISG20 C3
## NSUN5 C3
## ERRFI1 C3
## SARDH C3
## PODNL1 C3
## ADAMTS14 C4
## APBB1 C4
## OSCAR C4
## CCDC114 C4
## EPHB2 C4
## HS6ST1 C4
## VWC2L C4
## PNMA5 C4
## CHST1 C4
## SHISA5 C4
## FIGF C4
## UBE2Z C4
## RASSF10 C4
## SLC9A2 C4
## C2 C5
## FGR C5
## MAN2B2 C5
## MALT1 C5
## CPM C5
## FAM149A C5
## PNPLA4 C5
## RAB3C C6
## L1CAM C6
## TMTC1 C6
## LMO7 C6
## ANKH C6
## FLJ23867 C6
## LOXL1 C6
## HMX1 C7
## INPP5F C7
## H6PD C7
## NRXN2 C7
## CD1D C7
## TTC9 C7
## SULF1 C7
## DHH C7
## ADPRH C7
## SLC9A1 C7
## ALOX15B C7
## PTX3 C8
## CDCP1 C8
## SNX10 C8
## FSTL3 C8
## SPAG4 C8
## HMOX1 C8
## TNIP1 C8
## RRN3P2 C8
## EPS8L2 C8
## MIER1 C9
## ZNF136 C9
## ZBTB6 C9
## CCBL2 C9
## ABCB7 C9
## ZNF432 C9
## RCOR3 C9
## GRHPR C10
## GDI2 C10
## RRAGB C10
## KPNA5 C10
## PHKB C10
#save cluster table results to csvs
write.csv(glioblastoma_100_Genes_C, "Table_Glioblastoma_100genes_10WClust.csv")
write.csv(glioma_100_Genes_C, "Table_Glioma_100genes_10WClust.csv")
library(ggplot2)
library(ggfortify)
#Create wssplot function to help determine ideal number of clusters based on data
wssplot <- function(data, nc=20, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
#K-means clustering of glioblastoma dataset
head(glioblastoma)
## GeneName GeneFunction
## 39 ABCB7 ATP-binding cassette, sub-family B (MDR/TAP), member 7
## 62 ABCG2 ATP-binding cassette, sub-family G (WHITE), member 2
## 105 ACADM Acyl-CoA dehydrogenase, C-4 to C-12 straight chain
## 110 ACAP1 ArfGAP with coiled-coil, ankyrin repeat and PH domains 1
## 138 ACOT7 Acyl-CoA thioesterase 7
## 174 ACTL6A Actin-like 6A
## Unweighted_meta_Z_allcancers precog_Z Cox coefficient pvalue
## 39 0.4043703 -1.1253262 -1.8190200 -1.4916750
## 62 -2.5096316 -2.0565740 -1.6583421 -1.0904220
## 105 -1.2344034 -0.6418212 -1.3821405 0.5990641
## 110 -0.8348805 0.2043125 0.5984089 -0.3160742
## 138 1.2878316 -0.9980128 0.8907267 -1.6113469
## 174 2.9892534 0.6034188 -1.4062422 1.0918309
## bh_pvalue Median Expression Mean Expression
## 39 -1.2726904 -0.16881740 -0.17033129
## 62 -0.5376288 -0.09576952 -0.09488017
## 105 0.4737663 -0.01578161 -0.04205819
## 110 0.4012574 -0.26797451 -0.25152557
## 138 -2.2640969 -0.10571474 -0.09930900
## 174 0.4871126 -0.05723424 -0.06414104
row.names(glioblastoma) <- glioblastoma$GeneName
df.km.glioblastoma <- glioblastoma[-c(1:2)]
colnames(df.km.glioblastoma)<- c("unweighted_meta_z_all", "precog_Z", "cox_coefficient","pvalue", "bh_pvalue", "median_expression", "mean_expresson")
wssplot(df.km.glioblastoma)
library(NbClust)
nc.km.glioblastoma <- NbClust(df.km.glioblastoma, min.nc=2, max.nc=25, method="kmeans")
## Warning in pf(beale, pp, df2): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 9 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
## * 1 proposed 18 as the best number of clusters
## * 2 proposed 25 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
par(opar)
table(nc.km.glioblastoma$Best.n[1,])
##
## 0 2 3 4 5 8 15 18 25
## 2 7 2 9 1 1 1 1 2
barplot(table(nc.km.glioblastoma$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters of 512 glioblastoma Genes Chosen by 7 Criteria")
library(cluster)
fit.km.glioblastoma <- kmeans(df.km.glioblastoma, 4, nstart=25)
fit.km.glioblastoma$size
## [1] 102 266 2 142
fit.km.glioblastoma$centers
## unweighted_meta_z_all precog_Z cox_coefficient pvalue bh_pvalue
## 1 -0.008409492 0.2306924 0.6451177 -1.4233663 -1.5402096
## 2 -0.061382418 0.1910698 0.5391439 0.4569504 0.4370407
## 3 0.271743538 0.8033295 -0.3559794 0.5990641 0.4752023
## 4 0.117197214 -0.5349426 -1.4683263 0.1580031 0.2809728
## median_expression mean_expresson
## 1 -0.08335195 -0.08183629
## 2 -0.03740194 -0.03586408
## 3 13.40534252 14.16909386
## 4 -0.05887231 -0.07359887
aggregate(glioblastoma[-c(1:2)], by=list(cluster=fit.km.glioblastoma$cluster), mean)
## cluster Unweighted_meta_Z_allcancers precog_Z Cox coefficient
## 1 1 -0.008409492 0.2306924 0.6451177
## 2 2 -0.061382418 0.1910698 0.5391439
## 3 3 0.271743538 0.8033295 -0.3559794
## 4 4 0.117197214 -0.5349426 -1.4683263
## pvalue bh_pvalue Median Expression Mean Expression
## 1 -1.4233663 -1.5402096 -0.08335195 -0.08183629
## 2 0.4569504 0.4370407 -0.03740194 -0.03586408
## 3 0.5990641 0.4752023 13.40534252 14.16909386
## 4 0.1580031 0.2809728 -0.05887231 -0.07359887
clusplot(df.km.glioblastoma, fit.km.glioblastoma$cluster, color=TRUE, shade=TRUE,
labels=1, lines=0, main = "Glioblastoma K-means Cluster Plot")
#w/ ggplot
autoplot(kmeans(df.km.glioblastoma, 4), data = df.km.glioblastoma, frame = TRUE, main = "K-means Clustering of Glioblastoma Genes")
df.km <- fortify(kmeans(df.km.glioblastoma, 4), data = df.km.glioblastoma)
ggplot(df.km, aes(x= cluster, fill = cluster)) + geom_bar()
#K-means affected by outlier, check if PAM method better
set.seed(1234)
fit.pam.glioblastoma <- pam(df.km.glioblastoma, k = 4, stand=TRUE)
fit.pam.glioblastoma$medoids
## unweighted_meta_z_all precog_Z cox_coefficient pvalue
## RPL7L1 0.1113523 -0.27490099 -1.4120014 0.03590205
## CHST1 -0.1604180 0.16282839 0.5388807 0.24708782
## NRSN1 -0.1626039 -0.07105643 0.8844351 -1.40720070
## EEF1A1 -1.6740676 -0.81347979 -1.3181113 1.51420242
## bh_pvalue median_expression mean_expresson
## RPL7L1 0.4012574 -0.08261977 -0.1073970
## CHST1 0.4012574 -0.18954282 -0.1714919
## NRSN1 -1.1270882 -0.23836226 -0.2102697
## EEF1A1 0.5491473 16.09522707 14.6013264
clusplot(fit.pam.glioblastoma,color=TRUE, shade=TRUE,
labels=1, lines=0, main = "Glioblastoma PAM Cluster Plot")
#w/ggplot
autoplot(pam(df.km.glioblastoma, 4), data = df.km.glioblastoma, frame = TRUE, main = "PAM Clustering of Glioblastoma Genes")
df.pam <- fortify(pam(df.km.glioblastoma, 4), data = df.km.glioblastoma)
ggplot(df.pam, aes(x= cluster, fill = cluster)) + geom_bar()
#Pam method isn't better, need to remove outliers
Removing the outliers greatly improved the clustering results. However, outliers might still have valuable information and should be evaluated further.
#Need to remove outliers
library(outliers)
##
## Attaching package: 'outliers'
## The following object is masked from 'package:psych':
##
## outlier
outlier(df.km.glioblastoma)
## unweighted_meta_z_all precog_Z cox_coefficient
## 4.050315 -3.311827 -2.306378
## pvalue bh_pvalue median_expression
## -1.794192 -9.757258 16.095227
## mean_expresson
## 14.601326
outlier(df.km.glioblastoma, opposite = TRUE)
## unweighted_meta_z_all precog_Z cox_coefficient
## -3.4701961 2.5853601 2.0745167
## pvalue bh_pvalue median_expression
## 1.6549929 0.6062377 -0.2742224
## mean_expresson
## -0.2580407
rm.gbm.outliers <- c("EEF1A1", "SPP1", "PTPRN", "ZIC3", "PTPRN2", "CCT3","ENO1", "CTSB", "MTHFD2", "FUCA1",
"GNB2L1", "GDI2", "APLP2", "HSP90B1", "IGFBP2","NRCAM", "AEBP1", "HTRA1", "GSN")
tno.pam.glioblastoma<- df.km.glioblastoma[!rownames(df.km.glioblastoma) %in% rm.gbm.outliers,]
no.pam.glioblastoma<-scale(tno.pam.glioblastoma)*10
wssplot(no.pam.glioblastoma)
nbo.km.glioblastoma <- NbClust(tno.pam.glioblastoma, min.nc=2, max.nc=25, method="kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 16 as the best number of clusters
## * 1 proposed 18 as the best number of clusters
## * 1 proposed 21 as the best number of clusters
## * 1 proposed 22 as the best number of clusters
## * 2 proposed 25 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
par(opar)
table(nbo.km.glioblastoma$Best.n[1,])
##
## 0 1 2 3 4 6 16 18 21 22 25
## 2 1 4 10 1 2 1 1 1 1 2
barplot(table(nbo.km.glioblastoma$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters of GBM Genes Chosen by 7 Criteria with Outliers Removed")
#K means w/o outliers
no.km.glioblastoma <- kmeans(no.pam.glioblastoma, 3, iter.max = 16, nstart=25)
no.km.glioblastoma$size
## [1] 52 109 332
no.km.glioblastoma$centers
## unweighted_meta_z_all precog_Z cox_coefficient pvalue bh_pvalue
## 1 4.2827092 0.9371114 0.2617886 2.072610 2.608026
## 2 -0.1383413 0.6537113 2.7556122 -14.039213 -15.783358
## 3 -0.6253665 -0.3613986 -0.9457071 4.284634 4.773399
## median_expression mean_expresson
## 1 23.610877 23.983115
## 2 -1.860956 -1.931962
## 3 -3.087113 -3.122103
aggregate(no.pam.glioblastoma, by=list(cluster=no.km.glioblastoma$cluster), mean)
## cluster unweighted_meta_z_all precog_Z cox_coefficient pvalue
## 1 1 4.2827092 0.9371114 0.2617886 2.072610
## 2 2 -0.1383413 0.6537113 2.7556122 -14.039213
## 3 3 -0.6253665 -0.3613986 -0.9457071 4.284634
## bh_pvalue median_expression mean_expresson
## 1 2.608026 23.610877 23.983115
## 2 -15.783358 -1.860956 -1.931962
## 3 4.773399 -3.087113 -3.122103
str(no.km.glioblastoma)
## List of 9
## $ cluster : Named int [1:493] 2 2 3 3 2 3 3 3 3 3 ...
## ..- attr(*, "names")= chr [1:493] "ABCB7" "ABCG2" "ACADM" "ACAP1" ...
## $ centers : num [1:3, 1:7] 4.283 -0.138 -0.625 0.937 0.654 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:7] "unweighted_meta_z_all" "precog_Z" "cox_coefficient" "pvalue" ...
## $ totss : num 344400
## $ withinss : num [1:3] 36933 54762 121399
## $ tot.withinss: num 213093
## $ betweenss : num 131307
## $ size : int [1:3] 52 109 332
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
clusplot(no.pam.glioblastoma, no.km.glioblastoma$cluster, color=TRUE, shade=TRUE,
labels=1, lines=0, main = "Glioblastoma K-means Cluster Plot, Outliers Removed")
#W/ ggplot
autoplot(kmeans(tno.pam.glioblastoma, 4), data =tno.pam.glioblastoma, frame = TRUE, main = "K-means Clustering of Glioblastoma Genes")
no.df.km <- fortify(kmeans(tno.pam.glioblastoma, 4), data = tno.pam.glioblastoma)
ggplot(no.df.km, aes(x= cluster, fill = cluster)) + geom_bar()
#PAM plot w/o outliers
gg.pam.glioblastoma<-tno.pam.glioblastoma
colnames(gg.pam.glioblastoma)<- c("unweighted_meta_z_all", "precog_Z", "cox_coefficient","pvalue", "bh_pvalue", "median_expression", "mean_expresson")
autoplot(pam(gg.pam.glioblastoma, 3), data = gg.pam.glioblastoma, frame = TRUE, main = "PAM Clustering of Glioblastoma Genes, Outliers Removed")
no.df.km <- fortify(pam(no.pam.glioblastoma, 3), data = no.pam.glioblastoma)
ggplot(no.df.km, aes(x= cluster, fill = cluster)) + geom_bar()
##Visualizing Distance Matrices (STHDA)
A simple solution for visualizing the distance matrices is to use the function corrplot() [in corrplot package]. Here I continued my exploration of clustering analysis visualization by applying the 50 glioblastoma and glioma data to visual the eculidan distances clustering with heat maps.
library("corrplot")
#Use corrplot to view distance matrices
#Glioblastoma
d.r50.gbm.eucl<- dist(nc.r50.glioblastoma, method = "euclidean")
corrplot(as.matrix(d.r50.gbm.eucl), is.corr = FALSE, method = "color")
par(opar)
#Glioma
d.r50.gli.eucl<- dist(nc.r50.glioma, method = "euclidean")
corrplot(as.matrix(d.r50.gli.eucl), is.corr = FALSE, method = "color")
par(opar)
#Use heatmap w/ dendogram
#Glioblastoma
heatmap(as.matrix(d.r50.gbm.eucl), symm = TRUE, distfun = function(x) as.dist(x))
par(opar)
#Glioma
heatmap(as.matrix(d.r50.gli.eucl), symm = TRUE, distfun = function(x) as.dist(x))
par(opar)
I explored the factoextra clustering package. The function get_dist() is used for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function used in previous methods, get_dist() supports correlation-based distance measures including “pearson”, “kendall”, and “spearman”. The function fviz_dist() creats a heatmap like plot for visualizing the distance matrix.
I used the 50 glioma gene dataset to see how different correlation-based distance measured varried over the same data.
library(factoextra)
#50 Glioma genes using Pearson Method
p.res.dist <- get_dist(nc.r50.glioma, stand = TRUE, method = "pearson")
fviz_dist(p.res.dist,
gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
par(opar)
#50 Glioma genes using Kendall Method
k.res.dist <- get_dist(nc.r50.glioma, stand = TRUE, method = "kendall")
fviz_dist(k.res.dist,
gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
par(opar)
#50 Glioma genes using Spearman Method
s.res.dist <- get_dist(nc.r50.glioma, stand = TRUE, method = "spearman")
fviz_dist(s.res.dist,
gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
par(opar)
The factoextra package also has a function like nbclust to help determine the optimal number of clusters for a data set. I’ve tested in on the 50 glioma gene dataset below. Unlike the nbclust function used earlier, the fviz_nbclust function suggested 4 clusters instead of 3.
fviz_nbclust(nc.r50.glioma, kmeans, method = "gap_stat")
km.res <- kmeans(nc.r50.glioma, 4, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = nc.r50.glioma, geom = "point", show.clust.cent = TRUE, ellipse.type = "convex")+
theme_minimal()